Table of contents

  1. Exploratory data analysis
    1.1 Revenue decomposition by day of the week
    1.2 Revenue decomposition by type of day and season
    1.3 Trends
    1.4 HCF impact on sales

  2. Regression analysis - HCF
    2.1 HCF impact - Item A
    2.2 HCF impact - Item B

  3. Regression analysis - sales
    3.1 Understanding Item A sales
    3.2 Understanding Item B sales

    Appendix 1 - Multicollinearity checks
    Appendix 2 - Residual plots

1. Exploratory data analysis

1.1 Revenue decomposition by day of the week

display(HTML(sales_dist_div))
display(HTML(order_dist_div))
display(HTML(spo_dist_div))

Analysis

  • Weekends show the highest overall demand, but also exhibit the greatest fluctuations in sales (greater spread in boxplots).

  • Differences in sales between weekends and weekdays appear to be driven by variations in customer traffic, as the distribution of sales per order remains consistent across the week.

  • Weekdays demand is relatively stable, with overlapping sales distributions across Mondays to Thursdays.

  • Sales on holidays are comparable to Fridays, suggesting strong demand.

  • Wednesday holidays (e.g., Christmas Day and New Year’s Day) saw higher sales per order but fewer total orders, possibly reflecting larger group purchases or celebrations.


1.2 Revenue decomposition by type of day and season

Back to top

rev_decomp_fig.show()

Analysis

  • Summer revenue was driven by a high number of orders, however sales per order remained comparable to other the seasons.

  • Fall and Winter saw similar customer traffic, but Winter had higher average sales per order.

  • In Spring, the number of orders increased notably compared to Winter, though sales per order remained relatively stable.

  • Across all seasons except Summer, holidays experienced higher average sales per order when compared to weekends and Fridays.

  • Weekends consistently recorded higher sales per order than Fridays, regardless of season.

Recommendations


Statistical tests on recommendations

A. Is the average number of orders higher during the Spring/Summer compared to the Fall/Winter?


\[\begin{aligned} H_0 &: \mu_{\text{ Spring/Summer}} = \mu_{\text{ Fall/Winter}} \quad \text{(No difference in average orders)} \\ H_1 &: \mu_{\text{ Spring/Summer}} > \mu_{\text{ Fall/Winter}} \quad \text{(Spring/Summer has higher average orders)} \end{aligned}\]


order_warm = combined_df[(combined_df['season']=='Spring') | (combined_df['season']=='Summer')]['in_store_orders']
order_cold = combined_df[(combined_df['season']=='Fall') | (combined_df['season']=='Winter')]['in_store_orders']
# test for equal variance, a ratio under 2 (or above 0.5) is considered acceptable
(order_warm.describe()['std'] / order_cold.describe()['std']).round(2)
1.22
# perform two-sample t-test with equal variance
statistic, p_value = ttest_ind(order_warm, order_cold, alternative='greater') 

print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')
Test statistic: 8.984
p-value: 0.000

There is statistical evidence suggesting the average number of orders is higher during the Spring/Summer compared to the Fall/Winter.


B. Is the average sales per order higher during holiday’s compared to Fridays and Weekends (non-Summer seasons)?


\[\begin{aligned} H_0 &: \mu_{\text{ holidays}} = \mu_{\text{ weekends}} \quad \text{(No difference in average sales per order)} \\ H_1 &: \mu_{\text{ holidays}} > \mu_{\text{ weekends}} \quad \text{(Holidays have higher average sales per order)} \end{aligned}\]


Note: We will focus our test on comparing holidays and weekends. Since weekends already exhibit higher average sales per order than Fridays, a significant difference between holidays and weekends would also imply holidays differ sigificantly from Fridays.

spo_holiday = combined_df[(combined_df['type_of_day']=='Holiday') & (combined_df['season']!='Summer')]['sales_per_order']
spo_weekend = combined_df[(combined_df['type_of_day']=='Weekend') & (combined_df['season']!='Summer')]['sales_per_order']
# test for equal variance, a ratio under 2 is considered acceptable
(spo_holiday.describe()['std'] / spo_weekend.describe()['std']).round(2)
1.31
# since our sample for holiday's is small, let's perform a Shapiro-Wilk test to check for normality
statistic, p_value = shapiro(spo_holiday)

print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')
if p_value < 0.05:
    print('The data is not normally distributed.')
else:
    print('The data is normally distributed.')
Test statistic: 0.896
p-value: 0.164
The data is normally distributed.
# perform two-sample t-test 
statistic, p_value = ttest_ind(spo_holiday, spo_weekend, alternative='greater') 

print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')
Test statistic: 1.797
p-value: 0.038

There is statistical evidence suggesting the average sales per order is higher on holidays than weekends.


Recommendations


1.4 HCF impact on sales

Back to top

As HCF happened during the Winter, we’ll compare sales during the HCF days and non-HCF days in the Winter only for a more accurate comparison.

HCF_A_fig.show()
HCF_B_fig.show()

Analysis

  • HCF did not appear to have a strong effect on item A sales. Most days exhibited overlapping distributions, with Saturdays being the only exception.

  • HCF seems to have suppresssed item B sales, with three of the days exhibiting notably lower sales.

To supplement the graphical analysis above, we’ll perform a regression analysis.


2. Regression analysis - HCF

Back to top

2.1 HCF impact - Item A

Back to top

hcf_inference_df = inference_df.copy()
hcf_inference_df = hcf_inference_df[hcf_inference_df['season']=='Winter']
hcf_A_df = hcf_inference_df[['item_A_sales', 'rain', 'snow', 'is_long_weekend', 'is_holiday', 'day_of_week', 'is_HCF']]
formula = (
    """
        item_A_sales ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(day_of_week) 
        + C(is_HCF)
    """
    )

hcf_A_model = smf.ols(formula, data=hcf_A_df).fit()

print(hcf_A_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           item_A_sales   R-squared:                       0.791
Model:                            OLS   Adj. R-squared:                  0.761
Method:                 Least Squares   F-statistic:                     26.79
Date:                Wed, 13 Aug 2025   Prob (F-statistic):           4.32e-22
Time:                        23:18:23   Log-Likelihood:                -602.93
No. Observations:                  90   AIC:                             1230.
Df Residuals:                      78   BIC:                             1260.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                    1718.3903     63.038     27.260      0.000    1592.891    1843.889
is_long_weekend[T.True]       103.4549    134.401      0.770      0.444    -164.116     371.026
C(is_holiday)[T.True]         825.4585    124.029      6.655      0.000     578.535    1072.382
C(day_of_week)[T.Tuesday]     -87.6350     84.179     -1.041      0.301    -255.223      79.953
C(day_of_week)[T.Wednesday]    56.9839     85.123      0.669      0.505    -112.483     226.451
C(day_of_week)[T.Thursday]     12.3040     84.429      0.146      0.885    -155.781     180.389
C(day_of_week)[T.Friday]      442.7401     85.866      5.156      0.000     271.793     613.687
C(day_of_week)[T.Saturday]    801.0451     86.876      9.221      0.000     628.089     974.001
C(day_of_week)[T.Sunday]      831.5385     83.651      9.941      0.000     665.002     998.075
C(is_HCF)[T.True]              83.0554     53.258      1.559      0.123     -22.973     189.084
rain                           -2.1743      4.138     -0.525      0.601     -10.413       6.065
snow                          -36.6697     24.328     -1.507      0.136     -85.103      11.763
==============================================================================
Omnibus:                       20.793   Durbin-Watson:                   2.276
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               29.960
Skew:                           1.027   Prob(JB):                     3.12e-07
Kurtosis:                       4.942   Cond. No.                         53.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis

  • The coefficient for C(is_HCF)[T.True] is not statistically different from 0, indicating there is no strong evidence that item A sales during HCF differed from sales during non-HCF periods, after controlling for other variables. The positive coefficient could be interpreted as weak evidence that HCF did drive some item A, but not by a significant amount.

  • Item A sales during the Winter was still primarily explained by the weekly seasonality and holiday effects.


2.2 HCF impact - Item B

Back to top

hcf_b_df = hcf_inference_df[['item_B_sales', 'rain', 'snow', 'is_long_weekend', 'is_holiday', 'day_of_week', 'is_HCF']]
formula = (
    """
        item_B_sales ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(day_of_week) 
        + C(is_HCF)
    """
    )

hcf_B_model = smf.ols(formula, data=hcf_b_df).fit()

print(hcf_B_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           item_B_sales   R-squared:                       0.740
Model:                            OLS   Adj. R-squared:                  0.703
Method:                 Least Squares   F-statistic:                     20.17
Date:                Wed, 13 Aug 2025   Prob (F-statistic):           1.57e-18
Time:                        23:18:23   Log-Likelihood:                -606.43
No. Observations:                  90   AIC:                             1237.
Df Residuals:                      78   BIC:                             1267.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                    1665.4529     65.538     25.412      0.000    1534.977    1795.929
is_long_weekend[T.True]       419.5671    139.730      3.003      0.004     141.385     697.749
C(is_holiday)[T.True]         415.9336    128.948      3.226      0.002     159.218     672.649
C(day_of_week)[T.Tuesday]    -139.2911     87.517     -1.592      0.116    -313.525      34.942
C(day_of_week)[T.Wednesday]  -133.0181     88.499     -1.503      0.137    -309.206      43.170
C(day_of_week)[T.Thursday]    -48.4912     87.777     -0.552      0.582    -223.242     126.259
C(day_of_week)[T.Friday]      421.4689     89.271      4.721      0.000     243.743     599.195
C(day_of_week)[T.Saturday]    680.5938     90.321      7.535      0.000     500.779     860.409
C(day_of_week)[T.Sunday]      528.6963     86.968      6.079      0.000     355.555     701.837
C(is_HCF)[T.True]             -25.1586     55.370     -0.454      0.651    -135.392      85.075
rain                           -0.8709      4.303     -0.202      0.840      -9.437       7.695
snow                          -59.4070     25.293     -2.349      0.021    -109.761      -9.053
==============================================================================
Omnibus:                        0.036   Durbin-Watson:                   1.592
Prob(Omnibus):                  0.982   Jarque-Bera (JB):                0.191
Skew:                           0.002   Prob(JB):                        0.909
Kurtosis:                       2.774   Cond. No.                         53.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis

  • Similar to item A, the coefficient for C(is_HCF)[T.True] is not statistically different from 0, indicating there is no strong evidence that item B sales during HCF differed from sales during non-HCF periods, after controlling for other variables.

  • Item B sales during the Winter were also primarily explained by the weekly seasonality and holiday effects, but compared to item A, long weekends and weather conditions also contributed.


3. Regression analysis - sales

Back to top

3.1 Understanding Item A sales

Back to top

formula = (
    """
        log_A ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(season)
        + C(day_of_week) 
    """
    )

A_model = smf.ols(formula, data=A_sales_df).fit()

print(A_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  log_A   R-squared:                       0.765
Model:                            OLS   Adj. R-squared:                  0.755
Method:                 Least Squares   F-statistic:                     81.22
Date:                Wed, 13 Aug 2025   Prob (F-statistic):           1.04e-93
Time:                        23:18:23   Log-Likelihood:                 296.79
No. Observations:                 339   AIC:                            -565.6
Df Residuals:                     325   BIC:                            -512.0
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       7.4555      0.019    395.924      0.000       7.418       7.493
is_long_weekend[T.True]         0.0023      0.026      0.089      0.929      -0.049       0.054
C(is_holiday)[T.True]           0.3111      0.035      8.773      0.000       0.241       0.381
C(season)[T.Spring]             0.0926      0.017      5.528      0.000       0.060       0.126
C(season)[T.Summer]             0.1608      0.016     10.344      0.000       0.130       0.191
C(season)[T.Fall]              -0.0198      0.016     -1.265      0.207      -0.051       0.011
C(day_of_week)[T.Tuesday]       0.0078      0.022      0.358      0.720      -0.035       0.051
C(day_of_week)[T.Wednesday]     0.0307      0.022      1.412      0.159      -0.012       0.073
C(day_of_week)[T.Thursday]      0.0130      0.022      0.598      0.550      -0.030       0.056
C(day_of_week)[T.Friday]        0.2332      0.022     10.661      0.000       0.190       0.276
C(day_of_week)[T.Saturday]      0.4024      0.022     18.119      0.000       0.359       0.446
C(day_of_week)[T.Sunday]        0.3527      0.022     16.002      0.000       0.309       0.396
rain                           -0.0012      0.001     -1.377      0.169      -0.003       0.000
snow                           -0.0119      0.010     -1.158      0.248      -0.032       0.008
==============================================================================
Omnibus:                       24.661   Durbin-Watson:                   1.235
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.402
Skew:                           0.500   Prob(JB):                     7.56e-09
Kurtosis:                       4.283   Cond. No.                         64.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis

  • Item A sales are not strongly correlated with weather conditions (rain or snow), indicating that customer behavior is relatively unaffected by these factors.

  • Variations in item A sales are primarily driven by seasonality — both in terms of the day of the week and the actual season — as well as whether the day is a holiday.

Model intepretation

exp_coef = pd.DataFrame({
    'features': A_model.params.index, 
    'coeff': np.exp(A_model.params.values).round(2),
    'p-value': A_model.pvalues.values.round(3)
})
exp_coef.set_index('features', inplace=True)
exp_coef
coeff p-value
features
Intercept 1729.27 0.000
is_long_weekend[T.True] 1.00 0.929
C(is_holiday)[T.True] 1.36 0.000
C(season)[T.Spring] 1.10 0.000
C(season)[T.Summer] 1.17 0.000
C(season)[T.Fall] 0.98 0.207
C(day_of_week)[T.Tuesday] 1.01 0.720
C(day_of_week)[T.Wednesday] 1.03 0.159
C(day_of_week)[T.Thursday] 1.01 0.550
C(day_of_week)[T.Friday] 1.26 0.000
C(day_of_week)[T.Saturday] 1.50 0.000
C(day_of_week)[T.Sunday] 1.42 0.000
rain 1.00 0.169
snow 0.99 0.248

Base case: Average item A sales = $1729.27 on a regular Monday in Winter with no rain and no snow.

is_long_weekend - No significant impact on sales.

is_holiday - All else being equal, on average, sales during on a holiday is roughly 36% higher than the base case.

season - No significant difference in item A sales between Winter and Fall.

  • Compared to Winter, and all else being equal:
    • Spring sales are 10% higher
    • Summer sales are 17% higher

day_of_week - No significant difference in item A sales between the weekdays (Mon - Thurs).

  • Relative to Monday, and all else being equal:
    • Friday sales are 26% higher
    • Saturday sales are 50% higher
    • Sunday sales are 42% higher

rain - No significant impact on sales.

snow - No significant impact on sales.

For example:

Average item A sales on a Friday in the Summer with no rain and no snow = $1729.27 x 1.26 x 1.17 = $2549.28


3.2 Understanding Item B sales

Back to top

B_sales_df = inference_df[['log_B', 'hours_opened', 'avg_temperature', 'rain', 'snow',
       'is_long_weekend', 'HCF_sales', 'is_holiday', 'season', 'day_of_week']]
formula = (
    """
        log_B ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(season)
        + C(day_of_week) 
    """
    )

B_model = smf.ols(formula, data=B_sales_df).fit()

print(B_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  log_B   R-squared:                       0.848
Model:                            OLS   Adj. R-squared:                  0.842
Method:                 Least Squares   F-statistic:                     140.0
Date:                Wed, 13 Aug 2025   Prob (F-statistic):          1.59e-124
Time:                        23:18:23   Log-Likelihood:                 259.21
No. Observations:                 339   AIC:                            -490.4
Df Residuals:                     325   BIC:                            -436.9
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       7.3495      0.021    349.337      0.000       7.308       7.391
is_long_weekend[T.True]         0.0977      0.029      3.319      0.001       0.040       0.156
C(is_holiday)[T.True]           0.2691      0.040      6.792      0.000       0.191       0.347
C(season)[T.Spring]             0.1546      0.019      8.261      0.000       0.118       0.191
C(season)[T.Summer]             0.4366      0.017     25.145      0.000       0.402       0.471
C(season)[T.Fall]               0.0472      0.017      2.703      0.007       0.013       0.082
C(day_of_week)[T.Tuesday]      -0.0057      0.024     -0.234      0.815      -0.054       0.042
C(day_of_week)[T.Wednesday]     0.0287      0.024      1.185      0.237      -0.019       0.076
C(day_of_week)[T.Thursday]      0.0447      0.024      1.842      0.066      -0.003       0.092
C(day_of_week)[T.Friday]        0.3131      0.024     12.807      0.000       0.265       0.361
C(day_of_week)[T.Saturday]      0.4406      0.025     17.760      0.000       0.392       0.489
C(day_of_week)[T.Sunday]        0.3849      0.025     15.633      0.000       0.336       0.433
rain                           -0.0064      0.001     -6.821      0.000      -0.008      -0.005
snow                           -0.0341      0.012     -2.966      0.003      -0.057      -0.011
==============================================================================
Omnibus:                        5.666   Durbin-Watson:                   1.408
Prob(Omnibus):                  0.059   Jarque-Bera (JB):                7.178
Skew:                          -0.127   Prob(JB):                       0.0276
Kurtosis:                       3.666   Cond. No.                         64.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Analysis

  • Item B sales is particularly weak during the Winter.

  • Item B sales are more sensitive to weather conditions, weekly seasonality and long weekends.

Model interpretation

exp_coef = pd.DataFrame({
    'features': B_model.params.index,
    'coeff': np.exp(B_model.params.values).round(2),
    'p-value': B_model.pvalues.values.round(3)
})
exp_coef.set_index('features', inplace=True)
exp_coef
coeff p-value
features
Intercept 1555.37 0.000
is_long_weekend[T.True] 1.10 0.001
C(is_holiday)[T.True] 1.31 0.000
C(season)[T.Spring] 1.17 0.000
C(season)[T.Summer] 1.55 0.000
C(season)[T.Fall] 1.05 0.007
C(day_of_week)[T.Tuesday] 0.99 0.815
C(day_of_week)[T.Wednesday] 1.03 0.237
C(day_of_week)[T.Thursday] 1.05 0.066
C(day_of_week)[T.Friday] 1.37 0.000
C(day_of_week)[T.Saturday] 1.55 0.000
C(day_of_week)[T.Sunday] 1.47 0.000
rain 0.99 0.000
snow 0.97 0.003

Base case: Average item B sales = $1555.37 on a regular Monday in Winter with no rain and no snow.

is_long_weekend - All else being equal, on average, sales are approximately 10% higher on long weekends compared to regular weekends.

is_holiday - All else being equal, on average, sales during on a holiday is roughly 31% higher than the base case.

season - Compared to Winter, and all else being equal: - Spring sales are 17% higher - Summer sales are 55% higher - Fall sales are 5% higher

day_of_week - No significant difference in item B sales between the weekdays (Mon - Thurs).

  • Relative to Monday, and all else being equal:
    • Friday sales are 37% higher
    • Saturday sales are 55% higher
    • Sunday sales are 47% higher

rain - Each additional 1mm of rainfall is associated with a 1% decrease in item B sales, all else being equal.

snow - Each additional 1mm of snowfall is associated with a 3% decrease in item B sales, all else being equal.

For example:

Average item B sales on a Friday in the Fall with 3mm of rain and no snow = $1555.37 x 1.37 x 1.05 x 0.99^3 = $2170.95


Understanding item C sales

C_sales_df = inference_df[['log_C', 'hours_opened', 'avg_temperature', 'rain', 'snow',
       'is_long_weekend', 'HCF_sales', 'is_holiday', 'season', 'day_of_week']]
formula = (
    """
        log_C ~ 
        + rain 
        + snow 
        + is_long_weekend
        + C(is_holiday)
        + C(season)
        + C(day_of_week) 
    """
    )

C_model = smf.ols(formula, data=C_sales_df).fit()

print(C_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  log_C   R-squared:                       0.605
Model:                            OLS   Adj. R-squared:                  0.590
Method:                 Least Squares   F-statistic:                     38.36
Date:                Wed, 13 Aug 2025   Prob (F-statistic):           8.52e-58
Time:                        23:18:23   Log-Likelihood:                 468.52
No. Observations:                 339   AIC:                            -909.0
Df Residuals:                     325   BIC:                            -855.5
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       7.0347      0.011    619.988      0.000       7.012       7.057
is_long_weekend[T.True]         0.0153      0.016      0.966      0.335      -0.016       0.047
C(is_holiday)[T.True]           0.0949      0.021      4.443      0.000       0.053       0.137
C(season)[T.Spring]             0.0757      0.010      7.501      0.000       0.056       0.096
C(season)[T.Summer]             0.0924      0.009      9.869      0.000       0.074       0.111
C(season)[T.Fall]               0.0074      0.009      0.784      0.434      -0.011       0.026
C(day_of_week)[T.Tuesday]      -0.0084      0.013     -0.641      0.522      -0.034       0.017
C(day_of_week)[T.Wednesday]     0.0131      0.013      1.005      0.316      -0.013       0.039
C(day_of_week)[T.Thursday]      0.0136      0.013      1.038      0.300      -0.012       0.039
C(day_of_week)[T.Friday]        0.0864      0.013      6.552      0.000       0.060       0.112
C(day_of_week)[T.Saturday]      0.1502      0.013     11.226      0.000       0.124       0.177
C(day_of_week)[T.Sunday]        0.1238      0.013      9.326      0.000       0.098       0.150
rain                           -0.0006      0.001     -1.221      0.223      -0.002       0.000
snow                           -0.0131      0.006     -2.108      0.036      -0.025      -0.001
==============================================================================
Omnibus:                       11.795   Durbin-Watson:                   1.258
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               13.369
Skew:                           0.359   Prob(JB):                      0.00125
Kurtosis:                       3.657   Cond. No.                         64.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
exp_coef = pd.DataFrame({
    'features': C_model.params.index,
    'coeff': np.exp(C_model.params.values).round(2),
    'p-values': C_model.pvalues.values.round(3)
})

exp_coef.set_index('features', inplace=True)
exp_coef
coeff p-values
features
Intercept 1135.40 0.000
is_long_weekend[T.True] 1.02 0.335
C(is_holiday)[T.True] 1.10 0.000
C(season)[T.Spring] 1.08 0.000
C(season)[T.Summer] 1.10 0.000
C(season)[T.Fall] 1.01 0.434
C(day_of_week)[T.Tuesday] 0.99 0.522
C(day_of_week)[T.Wednesday] 1.01 0.316
C(day_of_week)[T.Thursday] 1.01 0.300
C(day_of_week)[T.Friday] 1.09 0.000
C(day_of_week)[T.Saturday] 1.16 0.000
C(day_of_week)[T.Sunday] 1.13 0.000
rain 1.00 0.223
snow 0.99 0.036

Base case: Average item C sales = $1135.40 on a regular Monday in Winter with no rain and no snow.

is_long_weekend - No significant impact on sales.

is_holiday - All else being equal, on average, sales during on a holiday is roughly 10% higher than the base case.

season - No significant difference in item C sales between Winter and Fall.

  • Compared to Winter, and all else being equal:
    • Spring sales are 8% higher
    • Summer sales are 10% higher

day_of_week - No significant difference in item C sales between the weekdays (Mon - Thurs).

  • Relative to Monday, and all else being equal:
    • Friday sales are 9% higher
    • Saturday sales are 16% higher
    • Sunday sales are 13% higher

rain - No significant impact on sales.

snow - Each additional 1mm of snowfall is associated with a 1% decrease in item C sales, all else being equal.

For example:

Average item C sales on a Sunday in the Spring with no snow = $1135.40 x 1.13 x 1.08 = $1385.64


Appendix 1 - multicollinearity checks

Back to top

HCF item A / item B models

# check for multicollinearity with VIF - rule of thumb VIF < 5 is good.
X = pd.get_dummies(hcf_A_df, drop_first=True)
X = X.drop(columns=['item_A_sales'])
vif = [variance_inflation_factor(X.values, index) for index, value in enumerate(X.columns)] 

vif_df = pd.DataFrame({
    'Features': X.columns,
    'VIF': vif
})

vif_df
Features VIF
0 rain 1.685095
1 snow 1.172261
2 is_long_weekend_True 1.184665
3 is_holiday_True 1.375031
4 day_of_week_Tuesday 1.169418
5 day_of_week_Wednesday 1.241626
6 day_of_week_Thursday 1.178908
7 day_of_week_Friday 1.258241
8 day_of_week_Saturday 1.258572
9 day_of_week_Sunday 1.186628
10 is_HCF_True 1.656635
  • No signs of multicollinearity among the existing variables.

Item A / item B sales models

# check for multicollinearity with VIF - rule of thumb VIF < 5 is good.

X = pd.get_dummies(A_sales_df, drop_first=True)
X = X.drop(columns=['log_A'])
vif = [variance_inflation_factor(X, index) for index, value in enumerate(X.columns)]
vif_df = pd.DataFrame({
    'Features': X.columns,
    'VIF': vif
})

vif_df
Features VIF
0 hours_opened 17.307752
1 avg_temperature 17.779757
2 rain 1.357277
3 snow 1.128092
4 HCF_sales 1.876263
5 is_long_weekend_True 1.644316
6 is_holiday_True 1.542749
7 season_Spring 2.402694
8 season_Summer 7.137513
9 season_Fall 3.488629
10 day_of_week_Tuesday 2.138480
11 day_of_week_Wednesday 2.159496
12 day_of_week_Thursday 2.149954
13 day_of_week_Friday 2.368185
14 day_of_week_Saturday 2.467691
15 day_of_week_Sunday 2.357782
  • The results indicate there is multicollinearity between hours_opened, avg_temperature and season_Summer. Since hours_opened is largely explained by the day of the week and avg_temperature is correlated with seasonal effects, we will drop these two variables to reduce redundancy and improve model stability.
# check for multicollinearity again after dropping 'hours_opened' and 'avg_temperature'

X = pd.get_dummies(A_sales_df, drop_first=True)
X = X.drop(columns=['log_A', 'hours_opened', 'avg_temperature'])
vif = [variance_inflation_factor(X, index) for index, value in enumerate(X.columns)]
vif_df = pd.DataFrame({
    'Features': X.columns,
    'VIF': vif
})

vif_df
Features VIF
0 rain 1.327447
1 snow 1.115561
2 HCF_sales 1.363251
3 is_long_weekend_True 1.634335
4 is_holiday_True 1.467373
5 season_Spring 1.646864
6 season_Summer 1.853490
7 season_Fall 1.911962
8 day_of_week_Tuesday 1.439529
9 day_of_week_Wednesday 1.457562
10 day_of_week_Thursday 1.431270
11 day_of_week_Friday 1.460691
12 day_of_week_Saturday 1.612651
13 day_of_week_Sunday 1.537483
  • There are no longer signs of multicollinearity after dropping hours_opened and avg_temperature.

Appendix 2: regression models - residual plots

Back to top

HCF - item A model

hcf_A_residuals = hcf_A_model.resid
hcf_A_fitted_values = hcf_A_model.fittedvalues

hcf_A_scatter = px.scatter(x=hcf_A_fitted_values, y=hcf_A_residuals)

hcf_A_scatter.add_hline(
    y=0,
    line_dash='dash',
    line_color='red'
)

hcf_A_scatter.update_layout(
    title='Fitted values vs residuals',
    yaxis=dict(
        title='Residuals'
    ),
    xaxis=dict(
        title='Fitted values'
    ),
    width=600,
    height=400
)

hcf_A_scatter.show()

There is no discernable pattern in the residuals.

HCF - item B model

hcf_B_residuals = hcf_B_model.resid
hcf_B_fitted_values = hcf_B_model.fittedvalues

hcf_B_scatter = px.scatter(x=hcf_B_fitted_values, y=hcf_B_residuals)

hcf_B_scatter.add_hline(
    y=0,
    line_dash='dash',
    line_color='red'
)

hcf_B_scatter.update_layout(
    title='Fitted values vs residuals',
    yaxis=dict(
        title='Residuals'
    ),
    xaxis=dict(
        title='Fitted values'
    ),
    width=600,
    height=400
)

hcf_B_scatter.show()

There is no discernable pattern in the residuals.

Item A model

A_residuals = A_model.resid
A_fitted_values = A_model.fittedvalues

A_scatter = px.scatter(x=A_fitted_values, y=A_residuals)

A_scatter.add_hline(
    y=0,
    line_dash='dash',
    line_color='red'
)

A_scatter.update_layout(
    title='Fitted values vs residuals',
    yaxis=dict(
        title='Residuals'
    ),
    xaxis=dict(
        title='Fitted values'
    ),
    width=600,
    height=400
)

A_scatter.show()

There is no discernable pattern in the residuals.

Item B model

B_residuals = B_model.resid
B_fitted_values = B_model.fittedvalues

B_scatter = px.scatter(x=B_fitted_values, y=B_residuals)

B_scatter.add_hline(
    y=0,
    line_dash='dash',
    line_color='red'
)

B_scatter.update_layout(
    title='Fitted values vs residuals',
    yaxis=dict(
        title='Residuals'
    ),
    xaxis=dict(
        title='Fitted values'
    ),
    width=600,
    height=400
)

B_scatter.show()

A couple bigger residuals for smaller fitted values, but overall no clear pattern.

Item C model

C_residuals = C_model.resid
C_fitted_values = C_model.fittedvalues

C_scatter = px.scatter(x=C_fitted_values, y=C_residuals)

C_scatter.add_hline(
    y=0,
    line_dash='dash',
    line_color='red'
)

C_scatter.update_layout(
    title='Fitted values vs residuals',
    yaxis=dict(
        title='Residuals'
    ),
    xaxis=dict(
        title='Fitted values'
    ),
    width=600,
    height=400
)

C_scatter.show()

A couple bigger residuals for smaller fitted values, but overall no clear pattern.